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1. INTRODUCTION 


The most commonly used methods of clustering and discrimination In picture- 
element (pixel) data sets from remote sensing problems Involve the assumption 
of normality of subsets of the data. Since the distributions are often very 
complex, use of this assumption can introduce error. Alternative, so-called 
nonparametrlc techniques may be preferable as a first step. This memorandum 
suggests some simple methods for choosing an initial nonparametric description 
of such data sets. 


2. NONPARAMETRIC DENSITY ESTIMATORS 


For the reasons cited In the Introduction, It Is often desirable to have an 
estimate of the unknown density function f(x) of an absolutely continuous 
random variable. This may be done by ascribing f(x) to a family of functions, 
such as the normal family, and then estimating those parameters which distin- 
guish the members of the family from each other. If one Is unwilling to 
assert In advance that the underlying distribution for a data set belongs to 
some parametric family, then one may use nonpar ametric estimators of density. 
The most common of these is the histogram; but there are others such as 
kernel estimators, k-th nearest neighbor estimators, and orthogonal series 
estimators. We will concentrate on the first two, although similar principles 
would seem to apply to the others mentioned. 

The constant width histogram is obtained by partitioning the domain of the 
random variable into intervals of a certain width h. Then, if x falls in 
the interval (x.,x.+h), 

?(x) = jjjj- (the number of data points which fall in (x^,x^+h), out 
of a sample of n) 

Notice that the appropriate "window width" h remains to be determined. If it 
is too large, features of the distribution may be obscured. If it is too 
small, peculiarities of the particular sample may dominate the estimate. 

Using the criterion of minimum Integrated Mean Squared Error, Scott (1979) has 
determined that the asymptotically optimal value for h is 

h n = (6//f'(x) 2 dx) 1/3 n' 1/3 

/ * « ? 

f (x) dx remains to be estimated from the data. This 
may be difficult for small n and time consuming for large n. Scott proposes 
that for a quick choice of h, it be assumed that the distribution Is normal, 
giving 


h* = 3.49sn" 1/3 
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where s Is the sample standard deviation. His empirical studies Indicate 

y * ' 2 

f (x) dx is often larger than for a normal distribution, that Is, 
many distributions of interest are less smooth than the normal. 

The Par 2 en (1962) kernel density estimator is given by 

*<«> ■ nK £ k hr i > 

x i 

where k is a standard density function, called the kernel, symmetric with 
mean zero and variance one. h is once again called the window width, and 
is a measure of how far away from each data point its influence is felt. 
Once again, using the Integrated Mean Squared Error criterion, Epanechnikov 
(1969) established that the optimum choice of h is 

h* - n-’^/MxlW' 5 (/k 2 (x)dx) ,/5 

He further established the optimum choice for k, and also noted that the 
choice of kernel makes comparatively little difference to the integrated 
mean squared error. We cnce again face the problem of estimating a measure 

y * " 2 

f (x) dx, with only the data 

to help us. This involves considerable computational and statistical 
difficulty. We could follow the lead of the previous section and assume 
that a reasonable value of h would be found by assuming the data is normal. 
Thus, using the uniform kernel, we get 

h* = 1.06412sn' 1/5 
n 

This may be a good starting point for the exploratory adjustment of h. The 
arbitrary nature of the choice is still disquieting. 
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3. SMOOTHEST DISTRIBUTIONS 

Me will proceed to show that, under certain mild restrictions, there exists 
a smoothest distribution from the point of view of the histogram and a 
smoothest distribution from the point of view of the kernel estimators. 
These will immediately provide us with an upper bound to the choice of 
smoothing parameter h. 


Consider distributions f with two continuous derivations on the whole real 
line. We will construct an f with fixed variance that has minimum 
Jf (x) 2 dx. We will use a variational argument very similar to that used by 
Epanechnikov (1969) in constructing the optimal kernel. Assuming f has 
variance one, we will consider a slightly varied distribution f+5 such 
that J& - 0 J 6 = 0 Jv?6 = 0. Thus^f +6 ) 2 - Jf 2 will be negligible 
compared to £, so Jf 6 = 0. Integrating by parts, this becomes Jf 6 - 0. 
Since 6 is arbitrary up to the given restrictions, f must be a quadratic 
polynomial, and so f is a quartic polynomial. By an argument very close to 

f (x) dx is 


f (x) = (|| (1-x 2 ) 2 on L-1,11 
^ 0 elsewhere 

y * ' 2 

f (x) dx for 

the unit normal distribution is .14105 whereas for the smoothest distribution 
with variance one it is .1157. 


y * « ? 

f (x) dx for smoothness in the expression for the 
optimal kernel width we can do a similar computation. Among all distributions 

7 * ii 2 

f (x) dx is 

minimal given a fixed variance. We can derive it just as above, to get 


f(x) * (|| (1-x 2 ) 3 on L-1,11 
i 0 elsewhere 

y *" 2 

f (x) dx for the unit normal is 

.21157 and for the minimum valued function with variance one it is .14403. 
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4. APPLICATIONS 


We can now construct an upper bound to the bln width h for a histogram In 
terms of the sample standard deviation s. 

h = 3.71sn' 1/3 
max 

Any histogram with bin width much wider than this is presumably oversmoothed. 
Notice that this is only seven percent wider than the normal optimum proposed 
by Scott (1979). According to his results on sensitivity to optimum width, 
the maximum width histogram would only be one half of one percent larger 
than the optimum in integrated mean squared error. Thus, the maximum width 
histogram is a good starting point for graphical and iterative methods of 
seeking optimal representation. It carries with it the additional reassurance 
that adjustments in bin width need proceed in only one direction, toward 
smaller values. 

Similarly, the upper bound for the optimal choice of smoothing parameter h 
for a kernel density estimate is 

h y = 1. 15sn" 1/5 
max 

y 1 *!! p 

f (x) dx for 

our smoothest density. Again, it is only slightly larger than the optimal 
value for a normal distribution, and so is a plausible initial value in the 
downward search for the optimum of an unknown density. 

Clearly this variational approach may be generalized to other density 
estimation procedures in which the optimum value of the smoothing parameter 
depends on some related measure of density smoothness. For example, Terrell 
and Scott (1980) propose an estimator with a higher rate of convergence than 
a Parzen estimator whose optimal smoothing parameter depends upon the 
second and fourth derivative of the density. It should be noted that our 
density-free choices for smoothing parameter may be less useful in the case 
of higher-order methods than the kernel method because of the increasing 
sensitivity of such methods to deviation from the optimal window width. 
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